Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 955
Filter
1.
Curr Protoc ; 4(5): e1046, 2024 May.
Article in English | MEDLINE | ID: mdl-38717471

ABSTRACT

Whole-genome sequencing is widely used to investigate population genomic variation in organisms of interest. Assorted tools have been independently developed to call variants from short-read sequencing data aligned to a reference genome, including single nucleotide polymorphisms (SNPs) and structural variations (SVs). We developed SNP-SVant, an integrated, flexible, and computationally efficient bioinformatic workflow that predicts high-confidence SNPs and SVs in organisms without benchmarked variants, which are traditionally used for distinguishing sequencing errors from real variants. In the absence of these benchmarked datasets, we leverage multiple rounds of statistical recalibration to increase the precision of variant prediction. The SNP-SVant workflow is flexible, with user options to tradeoff accuracy for sensitivity. The workflow predicts SNPs and small insertions and deletions using the Genome Analysis ToolKit (GATK) and predicts SVs using the Genome Rearrangement IDentification Software Suite (GRIDSS), and it culminates in variant annotation using custom scripts. A key utility of SNP-SVant is its scalability. Variant calling is a computationally expensive procedure, and thus, SNP-SVant uses a workflow management system with intermediary checkpoint steps to ensure efficient use of resources by minimizing redundant computations and omitting steps where dependent files are available. SNP-SVant also provides metrics to assess the quality of called variants and converts between VCF and aligned FASTA format outputs to ensure compatibility with downstream tools to calculate selection statistics, which are commonplace in population genomics studies. By accounting for both small and large structural variants, users of this workflow can obtain a wide-ranging view of genomic alterations in an organism of interest. Overall, this workflow advances our capabilities in assessing the functional consequences of different types of genomic alterations, ultimately improving our ability to associate genotypes with phenotypes. © 2024 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol: Predicting single nucleotide polymorphisms and structural variations Support Protocol 1: Downloading publicly available sequencing data Support Protocol 2: Visualizing variant loci using Integrated Genome Viewer Support Protocol 3: Converting between VCF and aligned FASTA formats.


Subject(s)
Polymorphism, Single Nucleotide , Software , Workflow , Polymorphism, Single Nucleotide/genetics , Computational Biology/methods , Genomics/methods , Molecular Sequence Annotation/methods , Whole Genome Sequencing/methods
2.
Brief Bioinform ; 25(3)2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38706315

ABSTRACT

In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than 15000 possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a considerable amount of training data, which can be a challenge with poorly populated families. To address this issue, we propose and evaluate here a novel protocol based on transfer learningThis requires the use of protein large language models (LLMs), trained with self-supervision on big unnanotated datasets in order to obtain sequence embeddings. Then, the embeddings can be used with supervised learning on a small and annotated dataset for a specialized task. In this protocol we have evaluated several cutting-edge protein LLMs together with machine learning architectures to improve the actual prediction of protein domain annotations. Results are significatively better than state-of-the-art for protein families classification, reducing the prediction error by an impressive 60% compared to standard methods. We explain how LLMs embeddings can be used for protein annotation in a concrete and easy way, and provide the pipeline in a github repo. Full source code and data are available at https://github.com/sinc-lab/llm4pfam.


Subject(s)
Databases, Protein , Proteins , Proteins/chemistry , Molecular Sequence Annotation/methods , Computational Biology/methods , Machine Learning
3.
Bioinformatics ; 40(4)2024 Mar 29.
Article in English | MEDLINE | ID: mdl-38640488

ABSTRACT

MOTIVATION: The ENCODE project generated a large collection of eCLIP-seq RNA binding protein (RBP) profiling data with accompanying RNA-seq transcriptomes of shRNA knockdown of RBPs. These data could have utility in understanding the functional impact of genetic variants, however their potential has not been fully exploited. We implement INCA (Integrative annotation scores of variants for impact on RBP activities) as a multi-step genetic variant scoring approach that leverages the ENCODE RBP data together with ClinVar and integrates multiple computational approaches to aggregate evidence. RESULTS: INCA evaluates variant impacts on RBP activities by leveraging genotypic differences in cell lines used for eCLIP-seq. We show that INCA provides critical specificity, beyond generic scoring for RBP binding disruption, for candidate variants and their linkage-disequilibrium partners. As a result, it can, on average, augment scoring of 46.2% of the candidate variants beyond generic scoring for RBP binding disruption and aid in variant prioritization for follow-up analysis. AVAILABILITY AND IMPLEMENTATION: INCA is implemented in R and is available at https://github.com/keleslab/INCA.


Subject(s)
RNA-Binding Proteins , Humans , RNA-Binding Proteins/metabolism , RNA-Binding Proteins/genetics , Software , Genetic Variation , Computational Biology/methods , Molecular Sequence Annotation/methods
4.
BMC Bioinformatics ; 25(1): 165, 2024 Apr 25.
Article in English | MEDLINE | ID: mdl-38664627

ABSTRACT

BACKGROUND: The annotation of protein sequences in public databases has long posed a challenge in molecular biology. This issue is particularly acute for viral proteins, which demonstrate limited homology to known proteins when using alignment, k-mer, or profile-based homology search approaches. A novel methodology employing Large Language Models (LLMs) addresses this methodological challenge by annotating protein sequences based on embeddings. RESULTS: Central to our contribution is the soft alignment algorithm, drawing from traditional protein alignment but leveraging embedding similarity at the amino acid level to bypass the need for conventional scoring matrices. This method not only surpasses pooled embedding-based models in efficiency but also in interpretability, enabling users to easily trace homologous amino acids and delve deeper into the alignments. Far from being a black box, our approach provides transparent, BLAST-like alignment visualizations, combining traditional biological research with AI advancements to elevate protein annotation through embedding-based analysis while ensuring interpretability. Tests using the Virus Orthologous Groups and ViralZone protein databases indicated that the novel soft alignment approach recognized and annotated sequences that both blastp and pooling-based methods, which are commonly used for sequence annotation, failed to detect. CONCLUSION: The embeddings approach shows the great potential of LLMs for enhancing protein sequence annotation, especially in viral genomics. These findings present a promising avenue for more efficient and accurate protein function inference in molecular biology.


Subject(s)
Algorithms , Molecular Sequence Annotation , Sequence Alignment , Molecular Sequence Annotation/methods , Sequence Alignment/methods , Viral Proteins/genetics , Viral Proteins/chemistry , Genes, Viral , Databases, Protein , Computational Biology/methods , Amino Acid Sequence
5.
Nat Methods ; 21(5): 793-797, 2024 May.
Article in English | MEDLINE | ID: mdl-38509328

ABSTRACT

SQANTI3 is a tool designed for the quality control, curation and annotation of long-read transcript models obtained with third-generation sequencing technologies. Leveraging its annotation framework, SQANTI3 calculates quality descriptors of transcript models, junctions and transcript ends. With this information, potential artifacts can be identified and replaced with reliable sequences. Furthermore, the integrated functional annotation feature enables subsequent functional iso-transcriptomics analyses.


Subject(s)
Molecular Sequence Annotation , Transcriptome , Humans , Molecular Sequence Annotation/methods , Software , Gene Expression Profiling/methods , Sequence Analysis, RNA/methods , Protein Isoforms/genetics , High-Throughput Nucleotide Sequencing/methods
6.
J Mol Biol ; 436(4): 168416, 2024 02 15.
Article in English | MEDLINE | ID: mdl-38143020

ABSTRACT

Neuropeptides not only work through nervous system but some of them also work peripherally to regulate numerous physiological processes. They are important in regulation of numerous physiological processes including growth, reproduction, social behavior, inflammation, fluid homeostasis, cardiovascular function, and energy homeostasis. The various roles of neuropeptides make them promising candidates for prospective therapeutics of different diseases. Currently, NeuroPep has been updated to version 2.0, it now holds 11,417 unique neuropeptide entries, which is nearly double of the first version of NeuroPep. When available, we collected information about the receptor for each neuropeptide entry and predicted the 3D structures of those neuropeptides without known experimental structure using AlphaFold2 or APPTEST according to the peptide sequence length. In addition, DeepNeuropePred and NeuroPred-PLM, two neuropeptide prediction tools developed by us recently, were also integrated into NeuroPep 2.0 to help to facilitate the identification of new neuropeptides. NeuroPep 2.0 is freely accessible at https://isyslab.info/NeuroPepV2/.


Subject(s)
Databases, Protein , Molecular Sequence Annotation , Neuropeptides , Amino Acid Sequence , Neuropeptides/chemistry , Molecular Sequence Annotation/methods
7.
J Biol Chem ; 299(9): 105130, 2023 09.
Article in English | MEDLINE | ID: mdl-37543366

ABSTRACT

Long noncoding RNAs (lncRNAs) are increasingly being recognized as modulators in various biological processes. However, due to their low expression, their systematic characterization is difficult to determine. Here, we performed transcript annotation by a newly developed computational pipeline, termed RNA-seq and small RNA-seq combined strategy (RSCS), in a wide variety of cellular contexts. Thousands of high-confidence potential novel transcripts were identified by the RSCS, and the reliability of the transcriptome was verified by analysis of transcript structure, base composition, and sequence complexity. Evidenced by the length comparison, the frequency of the core promoter and the polyadenylation signal motifs, and the locations of transcription start and end sites, the transcripts appear to be full length. Furthermore, taking advantage of our strategy, we identified a large number of endogenous retrovirus-associated lncRNAs, and a novel endogenous retrovirus-lncRNA that was functionally involved in control of Yap1 expression and essential for early embryogenesis was identified. In summary, the RSCS can generate a more complete and precise transcriptome, and our findings greatly expanded the transcriptome annotation for the mammalian community.


Subject(s)
Molecular Sequence Annotation , RNA, Long Noncoding , RNA-Seq , Animals , Embryonic Development/genetics , Mammals/embryology , Mammals/genetics , Molecular Sequence Annotation/methods , Promoter Regions, Genetic/genetics , Reproducibility of Results , Retroviridae/genetics , RNA, Long Noncoding/genetics , RNA-Seq/methods , Transcription Initiation Site , Transcriptome/genetics , YAP-Signaling Proteins/genetics , YAP-Signaling Proteins/metabolism
8.
Genome Biol ; 24(1): 135, 2023 06 08.
Article in English | MEDLINE | ID: mdl-37291671

ABSTRACT

BACKGROUND: In every living species, the function of a protein depends on its organization of structural domains, and the length of a protein is a direct reflection of this. Because every species evolved under different evolutionary pressures, the protein length distribution, much like other genomic features, is expected to vary across species but has so far been scarcely studied. RESULTS: Here we evaluate this diversity by comparing protein length distribution across 2326 species (1688 bacteria, 153 archaea, and 485 eukaryotes). We find that proteins tend to be on average slightly longer in eukaryotes than in bacteria or archaea, but that the variation of length distribution across species is low, especially compared to the variation of other genomic features (genome size, number of proteins, gene length, GC content, isoelectric points of proteins). Moreover, most cases of atypical protein length distribution appear to be due to artifactual gene annotation, suggesting the actual variation of protein length distribution across species is even smaller. CONCLUSIONS: These results open the way for developing a genome annotation quality metric based on protein length distribution to complement conventional quality measures. Overall, our findings show that protein length distribution between living species is more uniform than previously thought. Furthermore, we also provide evidence for a universal selection on protein length, yet its mechanism and fitness effect remain intriguing open questions.


Subject(s)
Molecular Sequence Annotation , Proteins , Sequence Analysis, Protein , Amino Acid Sequence , Molecular Sequence Annotation/methods , Proteins/chemistry , Proteins/classification , Proteome , Sequence Analysis, Protein/methods , Eukaryota , Bacteria , Archaea
9.
Science ; 380(6643): eabn3107, 2023 04 28.
Article in English | MEDLINE | ID: mdl-37104600

ABSTRACT

Annotating coding genes and inferring orthologs are two classical challenges in genomics and evolutionary biology that have traditionally been approached separately, limiting scalability. We present TOGA (Tool to infer Orthologs from Genome Alignments), a method that integrates structural gene annotation and orthology inference. TOGA implements a different paradigm to infer orthologous loci, improves ortholog detection and annotation of conserved genes compared with state-of-the-art methods, and handles even highly fragmented assemblies. TOGA scales to hundreds of genomes, which we demonstrate by applying it to 488 placental mammal and 501 bird assemblies, creating the largest comparative gene resources so far. Additionally, TOGA detects gene losses, enables selection screens, and automatically provides a superior measure of mammalian genome quality. TOGA is a powerful and scalable method to annotate and compare genes in the genomic era.


Subject(s)
Eutheria , Genomics , Molecular Sequence Annotation , Animals , Female , Mice , Eutheria/genetics , Genome , Genomics/methods , Molecular Sequence Annotation/methods , Birds/genetics
10.
Science ; 379(6639): 1358-1363, 2023 03 31.
Article in English | MEDLINE | ID: mdl-36996195

ABSTRACT

Enzyme function annotation is a fundamental challenge, and numerous computational tools have been developed. However, most of these tools cannot accurately predict functional annotations, such as enzyme commission (EC) number, for less-studied proteins or those with previously uncharacterized functions or multiple activities. We present a machine learning algorithm named CLEAN (contrastive learning-enabled enzyme annotation) to assign EC numbers to enzymes with better accuracy, reliability, and sensitivity compared with the state-of-the-art tool BLASTp. The contrastive learning framework empowers CLEAN to confidently (i) annotate understudied enzymes, (ii) correct mislabeled enzymes, and (iii) identify promiscuous enzymes with two or more EC numbers-functions that we demonstrate by systematic in silico and in vitro experiments. We anticipate that this tool will be widely used for predicting the functions of uncharacterized enzymes, thereby advancing many fields, such as genomics, synthetic biology, and biocatalysis.


Subject(s)
Enzymes , Machine Learning , Molecular Sequence Annotation , Proteins , Sequence Analysis, Protein , Algorithms , Computational Biology , Enzymes/chemistry , Genomics , Proteins/chemistry , Reproducibility of Results , Molecular Sequence Annotation/methods , Sequence Analysis, Protein/methods , Biocatalysis
11.
Sci Rep ; 13(1): 1417, 2023 01 25.
Article in English | MEDLINE | ID: mdl-36697464

ABSTRACT

We report here a new application, CustomProteinSearch (CusProSe), whose purpose is to help users to search for proteins of interest based on their domain composition. The application is customizable. It consists of two independent tools, IterHMMBuild and ProSeCDA. IterHMMBuild allows the iterative construction of Hidden Markov Model (HMM) profiles for conserved domains of selected protein sequences, while ProSeCDA scans a proteome of interest against an HMM profile database, and annotates identified proteins using user-defined rules. CusProSe was successfully used to identify, in fungal genomes, genes encoding key enzyme families involved in secondary metabolism, such as polyketide synthases (PKS), non-ribosomal peptide synthetases (NRPS), hybrid PKS-NRPS and dimethylallyl tryptophan synthases (DMATS), as well as to characterize distinct terpene synthases (TS) sub-families. The highly configurable characteristics of this application makes it a generic tool, which allows the user to refine the function of predicted proteins, to extend detection to new enzymes families, and may also be applied to biological systems other than fungi and to other proteins than those involved in secondary metabolism.


Subject(s)
Fungi , Molecular Sequence Annotation , Secondary Metabolism , Software , Amino Acid Sequence , Molecular Sequence Annotation/methods , Peptide Synthases/genetics , Polyketide Synthases/genetics , Secondary Metabolism/genetics , Fungi/enzymology , Fungi/genetics , Tryptophan Synthase/genetics , Conserved Sequence/genetics
12.
Nucleic Acids Res ; 50(W1): W57-W65, 2022 07 05.
Article in English | MEDLINE | ID: mdl-35640593

ABSTRACT

The Annotation Query (AnnoQ) (http://annoq.org/) is designed to provide comprehensive and up-to-date functional annotations for human genetic variants. The system is supported by an annotation database with ∼39 million human variants from the Haplotype Reference Consortium (HRC) pre-annotated with sequence feature annotations by WGSA and functional annotations to Gene Ontology (GO) and pathways in PANTHER. The database operates on an optimized Elasticsearch framework to support real-time complex searches. This implementation enables users to annotate data with the most up-to-date functional annotations via simple queries instead of setting up individual tools. A web interface allows users to interactively browse the annotations, annotate variants and search variant data. Its easy-to-use interface and search capabilities are well-suited for scientists with fewer bioinformatics skills such as bench scientists and statisticians. AnnoQ also has an API for users to access and annotate the data programmatically. Packages for programming languages, such as the R package, are available for users to embed the annotation queries in their scripts. AnnoQ serves researchers with a wide range of backgrounds and research interests as an integrated annotation platform.


Subject(s)
Genetic Variation , Molecular Sequence Annotation , Software , Humans , Databases, Genetic , Internet , Molecular Sequence Annotation/methods , User-Computer Interface , Genetic Variation/genetics , Haplotypes/genetics , Programming Languages
13.
Gene ; 807: 145952, 2022 Jan 10.
Article in English | MEDLINE | ID: mdl-34500049

ABSTRACT

Extreme temperature is one of the serious threats to crop production in present and future scenarios of global climate changes. Lentil (Lens culinaris) is an important crop, and there is a serious lack of genetic information regarding environmental and temperature stresses responses. This study is the first report of evaluation of key genes and molecular mechanisms related to temperature stresses in lentil using the RNA sequencing technique. De novo transcriptome assembly created 44,673 contigs and differential gene expression analysis revealed 7494 differentially expressed genes between the temperature stresses and control group. Basic annotation of generated transcriptome assembly in our study led to the identification of 2765 novel transcripts that have not been identified yet in lentil genome draft v1.2. In addition, several unigenes involved in mechanisms of temperature sensing, calcium and hormone signaling and DNA-binding transcription factor activity were identified. Also, common mechanisms in response to temperature stresses, including the proline biosynthesis, the photosynthetic light reactions balancing, chaperone activity and circadian rhythms, are determined by the hub genes through the protein-protein interaction networks analysis. Deciphering the mechanisms of extreme temperature tolerance would be a new way for developing crops with enhanced plasticity against climate change. In general, this study has identified set of mechanisms and various genes related to cold and heat stresses which will be useful in better understanding of the lentil's reaction to temperature stresses.


Subject(s)
Lens Plant/growth & development , Lens Plant/genetics , Stress, Physiological/genetics , Climate Change , Cold Temperature/adverse effects , Cold-Shock Response/genetics , Crops, Agricultural/genetics , Gene Expression Profiling/methods , Gene Expression Regulation, Plant/genetics , Heat-Shock Response/genetics , Heat-Shock Response/physiology , Hot Temperature/adverse effects , Molecular Sequence Annotation/methods , Photosynthesis , Protein Interaction Maps/genetics , Temperature , Transcriptome/genetics
14.
Gene ; 808: 145996, 2022 Jan 15.
Article in English | MEDLINE | ID: mdl-34634440

ABSTRACT

Russula griseocarnosa is a well-known ectomycorrhizal mushroom, which is mainly distributed in the Southern China. Although several scholars have attempted to isolate and cultivate fungal strains, no accurate method for culture of artificial fruiting bodies has been presented owing to difficulties associated with mycelium growth on artificial media. Herein, we sequenced R. griseocarnosa genome using the second- and third-generation sequencing technologies, followed by de novo assembly of high-throughput sequencing reads, and GeneMark-ES, BLAST, CAZy, and other databases were utilized for functional gene annotation. We also constructed a phylogenetic tree using different species of fungi, and also conducted comparative genomics analysis of R. griseocarnosa against its four representative species. In addition, we evaluated the accuracy of one already sequenced genome of R. griseocarnosa based on the internal transcribed spacer (ITS) sequencing of that type of species. The assembly process resulted in identification of 230 scaffolds with a total genome size of 50.67 Mbp. The gene prediction showed that R. griseocarnosa genome included 14,229 coding sequences (CDs). In addition, 470 RNAs were predicted with 155 transfer RNAs (tRNAs), 49 ribosomal RNAs (rRNAs), 41 small noncoding RNAs (sRNAs), 42 small nuclear RNAs (snRNAs), and 183 microRNAs (miRNAs). The predicted protein sequences of R. griseocarnosa were analyzed to indicate the existence of carbohydrate-active enzymes (CAZymes), and the results revealed that 153 genes encoded CAZymes, which were distributed in 58 CAZyme families. These enzymes included 78 glycoside hydrolases (GHs), 34 glycosyl transferases (GTs), 30 auxiliary activities (AAs), 2 carbohydrate esterases (CEs), 8 carbohydrate-binding modules (CBMs), and only one polysaccharide lyase (PL). Compared with other fungi, R. griseocarnosa had fewer CAZymes, and the number and distribution of CAZymes were similar to other mycorrhizal fungi, such as Tricholoma matsutake and Suillus luteus. Well-defined effector proteins that were associated with mycorrhiza-induced small-secreted proteins (MiSSPs) were not found in R. griseocarnosa, which indicated that there may be some special effector proteins to interact with host plants in R. griseocarnosa. The genome of R. griseocarnosa may provide new insights into the energy metabolism of ectomycorrhizal (ECM) fungi, a reference to study ecosystem and evolutionary diversification of R. griseocarnosa, as well as promoting the study of artificial domestication.


Subject(s)
Basidiomycota/genetics , Basidiomycota/metabolism , Agaricales/genetics , China , Genome, Fungal/genetics , Genomics/methods , Molecular Sequence Annotation/methods , Mycorrhizae/genetics , Mycorrhizae/metabolism , Phylogeny , Whole Genome Sequencing/methods
15.
Genomics Proteomics Bioinformatics ; 20(3): 455-465, 2022 06.
Article in English | MEDLINE | ID: mdl-34954426

ABSTRACT

Exploring the genetic basis of human infertility is currently under intensive investigation. However, only a handful of genes have been validated in animal models as disease-causing genes in infertile men. Thus, to better understand the genetic basis of human spermatogenesis and bridge the knowledge gap between humans and other animal species, we construct the FertilityOnline, a database integrating the literature-curated functional genes during spermatogenesis into an existing spermatogenic database, SpermatogenesisOnline 1.0. Additional features, including the functional annotation and genetic variants of human genes, are also incorporated into FertilityOnline. By searching this database, users can browse the functional genes involved in spermatogenesis and instantly narrow down the number of candidates of genetic mutations underlying male infertility in a user-friendly web interface. Clinical application of this database was exampled by the identification of novel causative mutations in synaptonemal complex central element protein 1 (SYCE1) and stromal antigen 3 (STAG3) in azoospermic men. In conclusion, FertilityOnline is not only an integrated resource for spermatogenic genes but also a useful tool facilitating the exploration of the genetic basis of male infertility. FertilityOnline can be freely accessed at http://mcg.ustc.edu.cn/bsc/spermgenes2.0/index.html.


Subject(s)
DNA Mutational Analysis , Databases, Genetic , Infertility, Male , Molecular Sequence Annotation , Spermatogenesis , Humans , Male , Cell Cycle Proteins/genetics , Infertility, Male/genetics , Molecular Sequence Annotation/methods , Mutation , DNA Mutational Analysis/methods , Spermatogenesis/genetics , Online Systems
16.
PLoS Biol ; 19(12): e3001464, 2021 12.
Article in English | MEDLINE | ID: mdl-34871295

ABSTRACT

The UniProt knowledgebase is a public database for protein sequence and function, covering the tree of life and over 220 million protein entries. Now, the whole community can use a new crowdsourcing annotation system to help scale up UniProt curation and receive proper attribution for their biocuration work.


Subject(s)
Crowdsourcing/methods , Data Curation/methods , Molecular Sequence Annotation/methods , Amino Acid Sequence/genetics , Computational Biology/methods , Databases, Protein/trends , Humans , Literature , Proteins/metabolism , Stakeholder Participation
18.
Genes (Basel) ; 12(10)2021 10 19.
Article in English | MEDLINE | ID: mdl-34681040

ABSTRACT

Functional annotation allows adding biologically relevant information to predicted features in genomic sequences, and it is, therefore, an important procedure of any de novo genome sequencing project. It is also useful for proofreading and improving gene structural annotation. Here, we introduce FA-nf, a pipeline implemented in Nextflow, a versatile computational workflow management engine. The pipeline integrates different annotation approaches, such as NCBI BLAST+, DIAMOND, InterProScan, and KEGG. It starts from a protein sequence FASTA file and, optionally, a structural annotation file in GFF format, and produces several files, such as GO assignments, output summaries of the abovementioned programs and final annotation reports. The pipeline can be broken easily into smaller processes for the purpose of parallelization and easily deployed in a Linux computational environment, thanks to software containerization, thus helping to ensure full reproducibility.


Subject(s)
Genome/genetics , Molecular Sequence Annotation/methods , Software , Chromosome Mapping , Computational Biology , Genomics
19.
PLoS Genet ; 17(10): e1009768, 2021 10.
Article in English | MEDLINE | ID: mdl-34648488

ABSTRACT

Transposable elements (TEs) constitute the majority of flowering plant DNA, reflecting their tremendous success in subverting, avoiding, and surviving the defenses of their host genomes to ensure their selfish replication. More than 85% of the sequence of the maize genome can be ascribed to past transposition, providing a major contribution to the structure of the genome. Evidence from individual loci has informed our understanding of how transposition has shaped the genome, and a number of individual TE insertions have been causally linked to dramatic phenotypic changes. Genome-wide analyses in maize and other taxa have frequently represented TEs as a relatively homogeneous class of fragmentary relics of past transposition, obscuring their evolutionary history and interaction with their host genome. Using an updated annotation of structurally intact TEs in the maize reference genome, we investigate the family-level dynamics of TEs in maize. Integrating a variety of data, from descriptors of individual TEs like coding capacity, expression, and methylation, as well as similar features of the sequence they inserted into, we model the relationship between attributes of the genomic environment and the survival of TE copies and families. In contrast to the wholesale relegation of all TEs to a single category of junk DNA, these differences reveal a diversity of survival strategies of TE families. Together these generate a rich ecology of the genome, with each TE family representing the evolution of a distinct ecological niche. We conclude that while the impact of transposition is highly family- and context-dependent, a family-level understanding of the ecology of TEs in the genome can refine our ability to predict the role of TEs in generating genetic and phenotypic diversity.


Subject(s)
DNA Transposable Elements/genetics , Genome, Plant/genetics , Zea mays/genetics , Ecosystem , Evolution, Molecular , Genome-Wide Association Study/methods , Genomics/methods , Molecular Sequence Annotation/methods , Sequence Analysis, DNA/methods
20.
PLoS Comput Biol ; 17(10): e1009423, 2021 10.
Article in English | MEDLINE | ID: mdl-34648491

ABSTRACT

Segmentation and genome annotation (SAGA) algorithms are widely used to understand genome activity and gene regulation. These algorithms take as input epigenomic datasets, such as chromatin immunoprecipitation-sequencing (ChIP-seq) measurements of histone modifications or transcription factor binding. They partition the genome and assign a label to each segment such that positions with the same label exhibit similar patterns of input data. SAGA algorithms discover categories of activity such as promoters, enhancers, or parts of genes without prior knowledge of known genomic elements. In this sense, they generally act in an unsupervised fashion like clustering algorithms, but with the additional simultaneous function of segmenting the genome. Here, we review the common methodological framework that underlies these methods, review variants of and improvements upon this basic framework, and discuss the outlook for future work. This review is intended for those interested in applying SAGA methods and for computational researchers interested in improving upon them.


Subject(s)
Algorithms , Chromatin/genetics , Genome/genetics , Genomics/methods , Molecular Sequence Annotation/methods , Chromatin Immunoprecipitation Sequencing , Histone Code , Humans , Protein Binding
SELECTION OF CITATIONS
SEARCH DETAIL
...